Bayesian Linear Regression

Analysis of Flight Delay Data

Author

Sara Parrish (Advisor: Dr.Β Seals)

Published

Nov 13, 2024

Slides: slides.html

1 Introduction

Bayesian inference was initially developed by Reverend Thomas Bayes, but his ruminations on inverse probability wouldn’t be known until a friend polished and submitted his work to the Royal Society. Bayes’ work was eventually developed and refined by Laplace. Bayesian inference was wildly different from Fisher’s work in defining classical statistical theory (Lesaffre and Lawson 2012).

In opposition to the Bayesian approach is the frequentist approach. The frequentist approach considers the parameter of interest fixed and inference on the parameter is based on the result of repeated sampling. In the Bayesian approach, the parameter of interest is not fixed but stochastic, and inference on the parameter is based on observed data and prior knowledge (Lesaffre and Lawson 2012).

A benefit of the Bayesian approach lies in the ability to include prior knowledge through the selection of a prior. Priors can be subjective or objective. Subjective priors incorporate opinions, of an individual or of a group, which can negatively impact the perceived integrity of the findings. Objective priors are preferred which follow formal rules for determining noninformative priors (Lesaffre and Lawson 2012).

When prior knowledge is lackluster, has little information, or is otherwise not sufficient, a noninformative prior may be used. A common choice for a noninformative prior, in cases with binomial likelihood, is a uniform distribution, also known as a flat prior. In cases with a Gaussian likelihood, the noninformative prior would be a normal prior with \(\sigma^2 \to \infty\) which functions similarly to the flat prior. For cases with a Poisson likelihood, a Gamma(\(\alpha_0\), \(\beta_0\)) prior is used where the sum of the counts are \((\alpha_0 - 1)\) and the experiment size is \(\beta_0\) (Lesaffre and Lawson 2012). For normal linear regression models, conjugate normal-gamma priors can be used to provide a resulting posterior that is of the same family(Yan and Su 2009).

There are a variety of ways to summarize the posterior in order to derive conclusions about the parameter. Its location and variability can be specified by finding the mode, mean, and median of the distribution. Its range can be defined with the equal tail credible interval (not to be confused with the confidence interval in the frequentist approach) or with the high posterior density (HDP) interval. Future predictions for the parameter can be made through posterior predictive distributions (PPD) which factor out \(\theta\) with the assumption that past data will not influence future data, hierarchical independence(Lesaffre and Lawson 2012).

It is not uncommon for the marginal posterior density function of a parameter to be unavailable, requiring an alternate approach to extract insight. It is safe to assert that sampling techniques are used in nearly all Bayesian analyses(Yan and Su 2009). General purpose sampling algorithms available include the inverse ICDF method, the accept-reject algorithm, importance sampling, and the commonly used Monte Carlo integration. The Monte Carlo integration replaces the integral of the posterior with the average obtained from randomly sampled values to provide an expected value. Two popular Markov chain Monte Carlo (MCMC) methods are the Gibbs sampler and the Metropolis-Hastings (MH) algorithm(Lesaffre and Lawson 2012).

2 Methods

2.1 The Frequentist Framework

Linear regression can be achieved using a variety of methods, two of interest are frequentist and Bayesian. The frequentist approach to linear regression is the more familiar approach. It estimates the effects of independent variables(predictors) on dependent variables(the outcome). The regression coefficient is a point estimate, assumed to be a fixed value. Following is the frequentist linear model

\[ Y = \beta_0 + \beta_1X + \varepsilon \]

  • \(Y\) : Dependent variable, the outcome
  • \(\beta_0\) : y intercept
  • \(\beta_1\) : The regression coefficient
  • \(X\) : Independent variable
  • \(\varepsilon\) : Random error (Yan and Su 2009)
  • \(\hat\beta\) provides a point estimate

2.2 The Bayesian Framework

The Bayesian approach estimates the relationship between predictors and an outcome in a similar way, however it’s regression coefficient is not a point estimate, but a distribution. That is, the regression coefficient is not assumed to be a fixed value. The Bayesian approach also goes a step further then frequentist regression in it’s inclusion of prior data. The Bayesian approach is so named because it is based on Bayes’ rule which is written as follows:

\[ Posterior = \frac{Likelihood \times Prior}{Normalization} \]

  • The \(Prior\) is model of prior knowledge on the subject
  • The \(Likelihood\) is the probability of the data given the prior
  • The \(Normalization\) is a constant that ensures the posterior distribution is a valid density function whose integration is equal to 1
  • The \(Posterior\) is the probability model that expresses an updated view of the model parameters
  • From the initial parameters of the prior

In terms of calculating probability, Bayes’ rule can be written as

\[ p(B|A) = \frac{p(A|B)\cdot p(B)}{p(A)} \]

  • Bayes’ rule allows for the calculation of inverse probability (\(p(B|A) \text{ from } p(A|B)\))
    • \(p(B|A) \text{ and } p(A|B)\) are conditional probabilities
    • \(p(A) \text{ and } p(B)\) are marginal probabilities (Lesaffre and Lawson 2012)

For calculating the probability of continuous parameters, Bayes rule can be applied as

\[ \begin{align*} p(\theta|y) =& \frac{ L(\theta|y)p(\theta) }{p(y)}\\ \\ p(\theta|y) \propto & \text{ }L(\theta|y)p(\theta) \end{align*} \]

The normalization constant (\(p(y)\) above) ensures the posterior distribution is a valid distribution, but the posterior density function can be written without this constant. The resulting prediction is not a point estimate, but a distribution (Bayes1991?). The Bayesian approach is derived with Bayes’ theorem wherein the posterior distribution, the updated belief about the parameter given the data \(p(\theta|y)\), is proportional to the likelihood of \(\theta\) given \(y\), \(L(\theta|y)\), and the prior density of \(\theta\), \(p(\theta)\). The former is known as the likelihood function and would comprise the new data for analysis while the latter allows for the incorporation of prior knowledge regarding \(\theta\)(Yan and Su 2009).

\[ f(\theta|D) \propto L(\theta|D)f(\theta) \]

2.3 The Models

To generate a model for our analysis, we start with the normal data model \(Y_i|\beta_0, \beta_1, \sigma \sim N(\mu, \sigma^2)\) and include the mean specific to our predictor, departure time, \(\mu_i\). The model is

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \end{align*} \]

Where:

  • \(Y_i\) is the arrival delay for the i-th flight
  • \(X_i\) is the departure delay for the i-th flight
  • \(\mu_i = \beta_0 + \beta_1X_i\) is the local mean arrival delay, , specific to the departure time
  • \(\sigma^2\) is the variance of the errors
  • \(\overset{\text{ind}}{\sim}\) indicates conditional independence of each arrival delay with the given parameters

2.4 Prior Selection

Since we are only using two continuous variables for the first model, arrival delay and departure time, the regression parameters will be \(\beta_0\), \(\beta_1\), and \(\sigma\) for intercept, slope, and error As intercept and slope regression parameters can take any real value, we will use normal prior models (Johnson, Ott, and Dogucu 2022).

\[ \begin{align*} \beta_0 &\sim N(m_0, s^2_0)\\ \beta_1 &\sim N(m_1, s^2_1) \end{align*} \]

where \(m_0, s_0, m_1, \text{and } s_1\) are hyperparameters.

The standard deviation parameter must be positive, so we will use an exponential model (Johnson, Ott, and Dogucu 2022).

\[ \sigma \sim \text{Exp}(l) \]

Due to the fact that the exponential model is a special case of the Gamma model, with \(s = 1\), we can use the definitions of the mean and variance of the gamma model to to find that of the exponential model (Johnson, Ott, and Dogucu 2022).

\[ \begin{align*} E(\sigma) = \frac{1}{l} && \text{and} && SD(\sigma) = \frac{1}{l} \end{align*} \]

2.5 Tuning of Priors

Tip

Flat vs default vs tuned priors

2.6 The Bayesian Linear Regression Model

The Normal-normal regression model with a continuous predictor can be written as

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

2.7 Assumptions

Tip

β€œNormal regression assumptions

The appropriateness of the Bayesian Normal regression model (9.3) depends upon the following assumptions.

  • Structure of the data
    • Accounting for predictor \(X\), the observed data \(Y_i\) on case \(i\) is independent of the observed data on any other case \(j\).
  • Structure of the relationship
    • The typical \(Y\) outcome can be written as a linear function of predictor X, \(ΞΌ = Ξ²_0 + Ξ²_1X\).
  • Structure of the variability
    • At any value of predictor \(X\), the observed values of \(Y\) will vary normally around their average \(ΞΌ\) with consistent standard deviation \(Οƒ\).”

2.8 Statistical Programming

Data was collected by the Bureau of Transportation Statistics (BTS) and accessed through a dataset compiled by Patrick Zelazko (Zelazko 2023). The data was imported into R (R Core Team 2023) via CSV. This is a large time-series dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minutes and attributed reasons for delay. The function stan_glm() was used for simulation of the Normal Bayesian linear regression model from the β€œrstanarm” library(Brilleman et al. 2018). This function runs the Markov Chain Monte Carlo simulation as well with specified chains, iterations, and the ability to set a seed. These were set to 4 chains, 2000 iterations, and the seed was set to 123. Simulation of the posterior was done with the posterior_predict() function, also from the β€œrstanarm” library(Brilleman et al. 2018). Evaluation of the model was done by considering the data and it’s source, the assumptions of the model, and the accuracy of the prediction. The posterior predictions were evaluated with the prediction_summary() function from the β€œbayesrules”library (Dogucu, Johnson, and Ott 2021). This provided median absolute error (MAE) scaled MAE, and the proportion of values that fall within 50% and 95% confidence intervals.

Tip

Possibly k-fold cross validation, model averaging

3 Analysis

3.1 The Dataset

This is a large dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minuted and attributed reasons for delay. Following are the definitions of the given variables in this dataset.

Header Description
Fl Date Flight Date (yyyy-mm-dd)
Airline Airline Name
Airline DOT Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
Airline Code Unique Carrier Code
DOT Code An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
Fl Number Flight Number
Origin Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Origin City Origin City Name, State Code
Dest Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Dest City Destination City Name, State Code
CRS Dep Time CRS Departure Time (local time: hhmm)
Dep Time Actual Departure Time (local time: hhmm)
Dep Delay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
Taxi Out Taxi Out Time, in Minutes
Wheels Off Wheels Off Time (local time: hhmm)
Wheels On Wheels On Time (local time: hhmm)
Taxi In Taxi In Time, in Minutes
CRS Arr Time CRS Arrival Time (local time: hhmm)
Arr Time Actual Arrival Time (local time: hhmm)
Arr Delay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
Cancelled Cancelled Flight Indicator (1=Yes)
Cancellation Code Specifies The Reason For Cancellation
Diverted Diverted Flight Indicator (1=Yes)
CRS Elapsed Time CRS Elapsed Time of Flight, in Minutes
Actual Elapsed Time Elapsed Time of Flight, in Minutes
Air Time Flight Time, in Minutes
Distance Distance between airports (miles)
Carrier Delay Carrier Delay, in Minutes
Weather Delay Weather Delay, in Minutes
NAS Delay National Air System Delay, in Minutes
Security Delay Security Delay, in Minutes
Late Aircraft Delay Late Aircraft Delay, in Minutes

Table 1 shows the flights distinguished by a new variable, Flight Period, where each time period is comprised of an 8-hour segment (i.e.Β Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.

Table 1: Flight Delay Summary by Flight Period
Flight Period
Flight Period
Morning Afternoon Evening Total
TotalFlightsCount 1246031 (41.5%) 1423140 (47.4%) 330829 (11.0%) 3000000 (100%)
CancelledFlightsCount 30690 (38.8%) 38343 (48.4%) 10107 (12.8%) 79140 (100%)
DivertedFlightsCount 2555 (36.2%) 3901 (55.3%) 600 (8.5%) 7056 (100%)
AvgCRSDepTime 08:49:31 15:73:19 20:66:23 13:27:04
AvgDepTime 08:53:58 15:89:05 20:12:40 13:29:47
AvgDepDelay 5.23 12.93 16.51 10.12
AvgTaxiOut 16.87 16.44 16.65 16.64
AvgTaxiIn 7.75 7.78 6.95 7.68
AvgCRSArrTime 10:87:15 17:85:11 17:42:14 14:90:34
AvgArrTime 10:86:01 17:71:56 15:89:47 14:66:31
AvgArrDelay -0.77 7.34 10.04 4.26
AvgAirTime 114.12 109.8 116.31 112.31
CarrierDelayCount 86824 (29.2%) 162266 (54.6%) 47861 (16.1%) 296951 (100%)
SecurityDelayCount 887 (32.1%) 1434 (52.0%) 438 (15.9%) 2759 (100%)
WeatherDelayCount 8380 (26.7%) 18758 (59.7%) 4290 (13.7%) 31428 (100%)
NASDelayCount 80604 (31.4%) 144366 (56.3%) 31507 (12.3%) 256477 (100%)
LateAircraftDelayCount 42721 (16.5%) 168902 (65.2%) 47391 (18.3%) 259014 (100%)
Summary includes morning, afternoon, and evening flight periods.

3.2 Exploratory Data Analysis

The histograms in Figure 1 illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.

Figure 2 shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.

Figure 3 displays the average arrival delay for flights at their origin airport using β€œplotly” (Sievert 2020) and β€œggmap” packages(Kahle and Wickham 2013). This comes from the idea that a flight’s arrival delay is collinear with it’s departure delay which may be influenced by it’s origin airport. Airport location information sourced from Random Fractals Inc(Novak 2023).

Figures 4 and 5 show the correlation of the continuous variables and the correlation of the categorical and continuous variables respectively using the β€œcorrplot” package(Wei and Simko 2024). Figure 4 doesn’t provide any particularly valuable insight. Arrival delay has an expected positive correlation coefficient with departure delay. There appears to be a slight positive correlation between taxi times and arrival delay and elapsed time which is also expected.

Figure 5 shows the correlation of the continuous and categorical variables using the p-values from the t-tests and ANOVAs of each relation. Relationships with at least one significant p-value are shown in dark blue. This heatmap is of limited value as there appears to be many significant correlations, but without further inspection it is not known which value(s) of the categorical variables have a significant p-value.

Code
p_value_long <- melt(p_value_results, na.rm = TRUE)  # Remove NAs

ggplot(data = p_value_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "blue", high = "red", na.value = "gray90", name = "p-value") +
  labs(x = "Categorical Variable",
       y = "Continuous Variable",
       caption = "Heatmap of p-values for categorical vs continuous variables.", 
       tag = "Figure 5") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0), 
        plot.tag.position = "topleft")

3.3 Data Preprocessing

To prepare the data for analysis, a few changes were made. Diverted and cancelled flights were removed as they both have arrival delays set to β€œNA” due to them not arriving as scheduled. Including cancelled and diverted flights would introduce complexity and could lead to bias in the model. The goal of the model is to understand typical arrival delays, so the inclusion of cancelled and diverted flights is unnecessary. The dataset is also quite large with three million observations, so the dataset was sampled to account for computational efficiency. For the first model, the departure time variable was converted from an β€œHHMM” format to a β€œminutes past midnight” format for use by the stan_glm() function.

Code
ggplot(Delays_sample, aes(x = DEP_TIME)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(caption="Distribution of departure time.",
       x = "Departure Time (time HHMM)",
       y = "Frequency",
       tag = "Figure 6") +
  xlim(NA, 2400) +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0), 
        plot.tag.position = "topleft")

Code
convert_to_minutes <- function(time) {
  hour <- time %/% 100
  minute <- time %% 100
  total_minutes <- hour * 60 + minute
  return(total_minutes)
}

Delays_sample <- Delays_sample %>%
  mutate("DEP_TIME_MINS" = sapply(DEP_TIME, convert_to_minutes))
Code
ggplot(Delays_sample, aes(x = DEP_TIME_MINS)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(caption = "Distribution of adjusted departure time.",
        x = "Departure Time (mins past midnight)",
        y = "Frequency",
       tag = "Figure 7") +
  xlim(NA, 1440) +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0), 
        plot.tag.position = "topleft")

For the second model, β€œDay of the Week” was the chosen predictor variable which was created using the β€œlubridate” library(Grolemund and Wickham 2011).

Code
Delays_sample1 <- Delays_sample %>%
  mutate(FL_DATE = as.Date(FL_DATE, format = "%Y-%m-%d"))

ggplot(Delays_sample1, aes(x = FL_DATE)) +
  geom_histogram(binwidth = 30, fill = "skyblue", color = "black") +
  labs(
    caption = "Flight counts by date.",
    x = "Flight Date",
    y = "Count",
    tag = "Figure 8") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0), 
        plot.tag.position = "topleft")

Code
library(lubridate)

Delays_sample <- Delays_sample %>%
  mutate(DAY_OF_WEEK = wday(FL_DATE, label = TRUE, abbr = TRUE))


mean_delay_by_day <- Delays_sample %>%
  group_by(DAY_OF_WEEK) %>%
  summarise(mean_arr_delay = mean(ARR_DELAY),
            sd_arr_delay = sd(ARR_DELAY))
Code
ggplot(Delays_sample, aes(x = DAY_OF_WEEK, y = ARR_DELAY)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data = mean_delay_by_day, aes(x = DAY_OF_WEEK, y = mean_arr_delay), 
             color = "red4", size =3, shape = 8) +
  labs(
    caption = "Arrival delay by the day of the week.",
    x = "Day of the Week",
    y = "Arrival Delay (minutes)",
    tag = "Figure 9"  ) +
  theme_minimal()+
  ylim(-45,45) +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0), 
        plot.tag.position = "topleft")

3.4 Modeling

The Normal Data Model: Departure Time Predictor

For the continuous predictor, Departure Time, three normal models with different priors are sampled: - Flat Priors - Rstanarm’s Default Priors - Tuned Priors

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

Flat Continuous Model
Code
library(broom.mixed)
library(rstanarm)
library(ggpubr)

flat_model_dt <- stan_glm(ARR_DELAY ~ DEP_TIME_MINS,
                       data = Delays_sample,
                       family = gaussian(),
                       prior = NULL,
                       prior_intercept = NULL,
                       prior_aux = NULL,
                       chains = 4, iter = 2000, seed = 123,
                       refresh = 0
                       )

fmdt <- tidy(flat_model_dt, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

# Simulate a set of predictions
set.seed(123)
shortcut_prediction1 <- 
  posterior_predict(flat_model_dt, newdata = data.frame(DEP_TIME_MINS = 720))

ggarrange(
  mcmc_dens(shortcut_prediction1) + 
  xlab("Delay (minutes)") +
  ggtitle("Figure 10 (a) Predicted Arrival Delays for a Departure Time of Noon, Flat Priors")
)

Code
model <- flat_model_dt
ggarrange(mcmc_trace(model, size = 0.1) + ggtitle("Figure 10 (b) MCMC Trace") +ylab("pi"))

Code
ggarrange(mcmc_dens_overlay(model) + ggtitle("Figure 10 (c) MCMC Overlay") + ylab("density") + xlab("pi"))

Code
ggarrange(mcmc_acf(model) + ggtitle("Figure 10 (d) Autocorrelation"))

The posterior median relationship for the first model with flat priors is

\[ -10.92 + 0.02X. \]

That is, at \(0 \text{ minutes}\) (i.e.Β Midnight), the expected delay is \(-10.92\). For every minute past midnight, the delay is expected to increase by \(0.02 \text{ minutes}\) (\(1.2 \text{ seconds}\)). At noon(\(X = 720 \text{ minutes}\)), the expected delay for any flight is \(3.48 \text{ minutes}\) (\(\sim 3 \text{ minutes and } 29\text{ seconds}\)).

Default Continuous Model
Code
library(rstanarm)

default_model_dt <- stan_glm(ARR_DELAY ~ DEP_TIME_MINS,
                       data = Delays_sample,
                       family = gaussian(),
                       chains = 4, iter = 2000, seed = 123,
                       refresh = 0
                       )

dmdt <- tidy(default_model_dt, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

# Simulate a set of predictions
set.seed(123)
shortcut_prediction2 <- 
  posterior_predict(default_model_dt, newdata = data.frame(DEP_TIME_MINS = 720))

ggarrange(
  mcmc_dens(shortcut_prediction2) + 
  xlab("Delay (minutes)") +
  ggtitle("Figure 11 (a) Predicted Arrival Delays for a Departure Time of Noon, Default Priors")
)

Code
model <- default_model_dt
ggarrange(mcmc_trace(model, size = 0.1) + ggtitle("Figure 11 (b) MCMC Trace") +ylab("pi"))

Code
ggarrange(mcmc_dens_overlay(model) + ggtitle("Figure 11 (c) MCMC Overlay") + ylab("density") + xlab("pi"))

Code
ggarrange(mcmc_acf(model) + ggtitle("Figure 11 (d) Autocorrelation"))

The posterior median relationship for the first model with default priors (tuned via rstanarm package)

\[ -10.94 + 0.02X. \]

This version of the model only had a small decrease in the intercept.

Tuned Continuous Model
\(\beta_0\) informs the model intercept
Code
summary(Delays_sample$DEP_TIME_MINS) #mean departure time is 809.3 minutes (~ 1:30pm)

Delays_sample_filtered_B0 <- subset(Delays_sample, DEP_TIME_MINS >= 800 & DEP_TIME_MINS <= 820)

mean(Delays_sample_filtered_B0$ARR_DELAY) #m_0c = 2
sd(Delays_sample_filtered_B0$ARR_DELAY)  #s_0c = 36

\(\beta_{0c}\) reflects the typical arrival delay at a typical departure time. With a mean departure time at \(\sim\) 1:30pm, the average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 36 minutes.

\[ \beta_{0c} \sim N(2, 36^2) \]

\(\beta_1\) informs the model slope
Code
lm_model <- lm(ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample)

summary(lm_model)

coef(lm_model)["DEP_TIME_MINS"] #m_1 = 0.01903
summary(lm_model)$coefficients["DEP_TIME_MINS", "Std. Error"] #s_1 = 0.0005

The slope of the linear model indicates a 0.019 minute increase in arrival delay per minute increase in departure time, so we set \(m_1 = 0.02\). The standard error reflects high confidence at 0.0005, but as to not limit the model we will set it lower at \(s_1 = 0.01\).

\[ \beta_{1} \sim N(0.02, 0.01^2) \]

\(\sigma\) informs the regression standard deviation
Code
summary(lm_model)$sigma

To tune the exponential model, we set the expected value of the standard deviation, $ E() $, equal to the residual standard error, \(\sim 50\). With this, we can find the rate parameter, \(l\).

\[ \begin{align*} E(\sigma) &= \frac{1}{l} = 50\\\\ l &= \frac{1}{50} = 0.02\\\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(2, 36^2)\\ \beta_1 &\sim N(0.02, 0.01^2)\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

Code
library(rstanarm)

tuned_model_dt <- stan_glm(ARR_DELAY ~ DEP_TIME_MINS, 
                        data = Delays_sample,
                        family = gaussian(),
                        prior_intercept = normal(2, 1296),
                        prior = normal(0.02, 0.0001), 
                        prior_aux = exponential(0.02),
                        chains = 4, iter = 2000, seed = 123,
                        refresh = 0
                        )

library(broom.mixed)

tmdt <- tidy(tuned_model_dt, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

# Store the 4 chains for each parameter in 1 data frame
model_tuned_df <- as.data.frame(tuned_model_dt)

##TUNED
# Simulate a set of predictions
set.seed(123)
shortcut_prediction <- 
  posterior_predict(tuned_model_dt, newdata = data.frame(DEP_TIME_MINS = 720))

ggarrange(
  mcmc_dens(shortcut_prediction) + 
  xlab("Delay (minutes)") +
  ggtitle("Figure 12 (a) Predicted Arrival Delays for a Departure Time of Noon, Tuned Priors")
)

Code
model <- tuned_model_dt
ggarrange(mcmc_trace(model, size = 0.1) + ggtitle("Figure 12 (b) MCMC Trace") +ylab("pi"))

Code
ggarrange(mcmc_dens_overlay(model) + ggtitle("Figure 12 (c) MCMC Overlay") + ylab("density") + xlab("pi"))

Code
ggarrange(mcmc_acf(model) + ggtitle("Figure 12 (d) Autocorrelation"))

The posterior median relationship for the first model with manually tuned priors is

\[ -11.66 + 0.02X. \]

This version of the model had a larger decrease in the intercept (just over 43 seconds), but there was no large change in the slope.

The Normal Data Model: Week Day Predictor

The categorical predictor, Day of the Week, is given a similar treatment to the continuous predictor with the sampling of three normal models with different priors: - Flat Priors - Rstanarm’s Default Priors - Tuned Priors Tuesday is set to be the reference as it has the lowest mean delay.

\[ \begin{align*} Y_i|\beta_0, \beta_1, ... \beta_6, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... \beta_6X_{i6} \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

The general formula is adjusted for the inclusion of regression coefficients for each day of the week \(\beta_1, \beta_2, ..., \beta_6\), where the intercept coefficient \(\beta_0\) reflects Tuesday’s arrival delay. Only one prior for the predictor must be specified as the stan_glm() function handles categorical variables in a similar way to the glm() function, through dummy coding.

Flat Categorical Model
Code
Delays_sample$DAY_OF_WEEK <- factor(
  Delays_sample$DAY_OF_WEEK, 
  ordered = FALSE)

Delays_sample <- Delays_sample %>%
  mutate(DAY_OF_WEEK = relevel(as.factor(DAY_OF_WEEK), ref = "Tue"))

flat_model_dow <- stan_glm(
  ARR_DELAY ~ DAY_OF_WEEK, 
  data = Delays_sample, 
  family = gaussian(),
  prior = NULL, 
  prior_intercept = NULL, 
  prior_aux = NULL,
  chains = 4, iter = 2000, seed = 123,
  refresh = 0
)

model <- flat_model_dow

fmdow <- tidy(model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

library(ggplot2)
library(ggridges)
library(dplyr)

#Stacked distributions

new_data <- data.frame(DAY_OF_WEEK = levels(Delays_sample$DAY_OF_WEEK))
predictions <- posterior_predict(model, newdata = new_data)

pred_df <- as.data.frame(predictions)
colnames(pred_df) <- levels(Delays_sample$DAY_OF_WEEK)

pred_long <- pred_df %>%
  pivot_longer(cols = everything(), names_to = "DAY_OF_WEEK", values_to = "ARR_DELAY")


pred_long$DAY_OF_WEEK <- factor(pred_long$DAY_OF_WEEK, levels = c("Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"))


ggplot(pred_long, aes(x = ARR_DELAY, y = DAY_OF_WEEK, fill = DAY_OF_WEEK)) +
  geom_density_ridges(alpha = 0.7, scale = 0.8) +
  stat_summary(fun = mean, 
               geom = "vline", 
               aes(xintercept = ..x.., color = DAY_OF_WEEK),
               linetype = "dashed", linewidth = 0.5, show.legend = FALSE) +
  labs(
    title = "Figure 13 (a) Posterior Predicted Distribution of Arrival Delay by Day of the Week, Flat Priors",
    x = "Predicted Arrival Delay (minutes)",
    y = " ",
    fill = "Day"
  ) +
  xlim(-150,150)+
  theme_minimal()

Flat Categorical Model
Code
model <- flat_model_dow
ggarrange(mcmc_trace(model, size = 0.1) + ggtitle("Figure 13 (b) MCMC Trace") +ylab("pi"))

Code
ggarrange(mcmc_dens_overlay(model) + ggtitle("Figure 13 (c) MCMC Overlay") + ylab("density") + xlab("pi"))

Code
ggarrange(mcmc_acf(model) + ggtitle("Figure 13 (d) Autocorrelation"))

The posterior median relationship for the second model with flat priors is

\[ 1.57 + 4.92X_1 + 2.66X_2 + 1.86X_3 + 3.43X_4 + 4.36X_5 + 2.93X_6. \]

That is, on Tuesday one can expect a delay of \(1.57\) minutes (\(1\) minute and \(34\) seconds). One can expect the longest delay on Wednesday at \(6\) minutes and \(29\) seconds.

Default Categorical Model
Code
default_model_dow <- stan_glm(
  ARR_DELAY ~ DAY_OF_WEEK, 
  data = Delays_sample, 
  family = gaussian(),
  chains = 4, iter = 2000, seed = 123,
  refresh = 0
)

model <- default_model_dow

dmdow <- tidy(model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

library(ggplot2)
library(ggridges)
library(dplyr)

#Stacked distributions

new_data <- data.frame(DAY_OF_WEEK = levels(Delays_sample$DAY_OF_WEEK))
predictions <- posterior_predict(model, newdata = new_data)

pred_df <- as.data.frame(predictions)
colnames(pred_df) <- levels(Delays_sample$DAY_OF_WEEK)

pred_long <- pred_df %>%
  pivot_longer(cols = everything(), names_to = "DAY_OF_WEEK", values_to = "ARR_DELAY")


pred_long$DAY_OF_WEEK <- factor(pred_long$DAY_OF_WEEK, levels = c("Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"))

ggplot(pred_long, aes(x = ARR_DELAY, y = DAY_OF_WEEK, fill = DAY_OF_WEEK)) +
  geom_density_ridges(alpha = 0.7, scale = 0.8) +
  stat_summary(fun = mean, 
               geom = "vline", 
               aes(xintercept = ..x.., color = DAY_OF_WEEK),
               linetype = "dashed", linewidth = 0.5, show.legend = FALSE) +
  labs(
    title = "Figure 14 (a) Posterior Predicted Distribution of Arrival Delay by Day of the Week, Default Priors",
    x = "Predicted Arrival Delay (minutes)",
    y = " ",
    fill = "Day"
  ) +
  xlim(-150,150)+
  theme_minimal()

Code
model <- default_model_dow
ggarrange(mcmc_trace(model, size = 0.1) + ggtitle("Figure 14 (b) MCMC Trace") +ylab("pi"))

Code
ggarrange(mcmc_dens_overlay(model) + ggtitle("Figure 14 (c) MCMC Overlay") + ylab("density") + xlab("pi"))

Code
ggarrange(mcmc_acf(model) + ggtitle("Figure 14 (d) Autocorrelation"))

The posterior median relationship for the second model with default priors is

\[ 1.54 + 4.40X_1 + 2.69X_2 + 2.97X_3 + 4.96X_4 + 3.48X_5 + 1.89X_6. \]

This version of the second model showed little change for the expectations of Tuesday through Thursday, but the rest of the week had shifts of more than a minute for all day except Sunday with a change of \(53\) seconds.

Tuned Categorical Model

For arrival delays by the day of the week, the Figure 9 shows mean arrival delays are between 1 and 7 minutes while the median arrival delays are all in the negative, indicating a skew towards larger delays.

\(\beta_0\) informs the model intercept
Code
mean_delay_by_day

\(\beta_{0}\) reflects the mean arrival delay on Tuesday, our reference. The average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 46 minutes.

\[ \beta_{0} \sim N(2, 46^2) \]

\(\beta_j\) informs the model slopes

For a categorical predictor with the stan_glm() function, the tuned prior, \(\beta_j\), is applied to to the estimation of each coefficient associated with the individual levels of the predictor ($_1, _2, …, _6 $). For this reason, we set the coefficient prior to be weakly informative.

\[ \beta_{j} \sim N(0, 50^2) \]

\(\sigma\) informs the regression standard deviation
Code
lm_model <- lm(ARR_DELAY ~ DAY_OF_WEEK, data = Delays_sample)

summary(lm_model)$sigma

To tune the exponential model, we set the expected value of the standard deviation, $ E() $, equal to the residual standard error which is the same as with the previous model, \(\sim 50\).

\[ \begin{align*} E(\sigma) &= \frac{1}{l} = 50\\\\ l &= \frac{1}{50} = 0.02\\\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

The tuned model is as follows,

\[ \begin{align*} Y_i|\beta_0, \beta_1, ... \beta_6, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... \beta_6X_{i6} \\ \beta_{0} &\sim N(2, 46^2)\\ \beta_j &\sim N(0, 50^2)\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

Code
tuned_model_dow <- stan_glm(
  ARR_DELAY ~ DAY_OF_WEEK, 
  data = Delays_sample, 
  family = gaussian(),
  prior = normal(2,2116), 
  prior_intercept = normal(0,2500), 
  prior_aux = exponential(0.02),
  chains = 4, iter = 2000, seed = 123,
  refresh = 0
)

model <- tuned_model_dow

tmdow <- tidy(model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

library(ggplot2)
library(ggridges)
library(dplyr)

#Stacked distributions

new_data <- data.frame(DAY_OF_WEEK = levels(Delays_sample$DAY_OF_WEEK))
predictions <- posterior_predict(model, newdata = new_data)

pred_df <- as.data.frame(predictions)
colnames(pred_df) <- levels(Delays_sample$DAY_OF_WEEK)

pred_long <- pred_df %>%
  pivot_longer(cols = everything(), names_to = "DAY_OF_WEEK", values_to = "ARR_DELAY")


pred_long$DAY_OF_WEEK <- factor(pred_long$DAY_OF_WEEK, levels = c("Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"))

ggplot(pred_long, aes(x = ARR_DELAY, y = DAY_OF_WEEK, fill = DAY_OF_WEEK)) +
  geom_density_ridges(alpha = 0.7, scale = 0.8) +
  stat_summary(fun = mean, 
               geom = "vline", 
               aes(xintercept = ..x.., color = DAY_OF_WEEK),
               linetype = "dashed", linewidth = 0.5, show.legend = FALSE) +
  labs(
    title = "Figure 15 (a) Posterior Predicted Distribution of Arrival Delay by Day of the Week, Tuned Priors",
    x = "Predicted Arrival Delay (minutes)",
    y = " ",
    fill = "Day"
  ) +
  xlim(-150,150)+
  theme_minimal()

Code
model <- tuned_model_dow
ggarrange(mcmc_trace(model, size = 0.1) + ggtitle("Figure 15 (b) MCMC Trace") + ylab("pi"))

Code
ggarrange(mcmc_dens_overlay(model) + ggtitle("Figure 15 (c) MCMC Overlay") + ylab("density") + xlab("pi"))

Code
ggarrange(mcmc_acf(model) + ggtitle("Figure 15 (d) Autocorrelation"))

The posterior median relationship for the second model with manually tuned priors is

\[ 1.54 + 4.96X_1 + 2.70X_2 + 1.91X_3 + 3.50X_4 + 4.41X_5 + 2.99X_6. \]

The tuned version of the second model showed coefficients more similar to that of the flat model.

3.5 Results

Table 2. Estimations of the Posterior Distributions’ Regression Coefficients.
Mean SD 95% CI
Model 1: Continuous Predictor
Flat Priors 𝛽₀ intercept -10.92 0.47 (-11.85; 0.02)
𝛽₁ Departure Time 0.02 0.00 (0.02; 50.86)
𝜎 51.09 0.12 (50.86; -11.86)
Default Tuned Priors 𝛽₀  intercept -10.94 0.47 (-11.86; 0.02)
𝛽₁ Departure Time 0.02 0.00 (0.02; 50.86)
𝜎 51.09 0.12 (50.86; -12.02)
Tuned Priors 𝛽₀ intercept -11.66 0.17 (-12.02; 0.02)
𝛽₁ Departure Time 0.02 0.00 (0.02; 50.87)
𝜎 51.10 0.12 (50.87; 0.00)
Model 2: Categorical Predictor
Flat Priors 𝛽₀ intercept 1.57 0.44 (0.69; 3.72)
𝛽₁ Wednesday 4.92 0.61 (3.72; 1.47)
𝛽₂ Thursday 2.66 0.65 (1.47; 0.69)
𝛽₃ Friday 1.86 0.62 (0.69; 2.25)
𝛽₄ Saturday 3.43 0.61 (2.25; 3.21)
𝛽₅ Sunday 4.36 0.61 (3.21; 1.72)
𝛽₆ Monday 2.93 0.65 (1.72; 51.16)
𝜎 51.39 0.12 (51.16; 0.73)
Default Tuned Priors 𝛽₀ intercept 1.54 0.40 (0.73; 3.21)
𝛽₁ Wednesday 4.40 0.60 (3.21; 1.48)
𝛽₂ Thursday 2.69 0.60 (1.48; 1.75)
𝛽₃ Friday 2.97 0.64 (1.75; 3.77)
𝛽₄ Saturday 4.96 0.59 (3.77; 2.31)
𝛽₅ Sunday 3.48 0.58 (2.31; 0.74)
𝛽₆ Monday 1.89 0.60 (0.74; 51.17)
𝜎 51.40 0.12 (51.17; 0.66)
Tuned Priors 𝛽₀ intercept 1.54 0.44 (0.66; 3.76)
𝛽₁ Wednesday 4.96 0.64 (3.76; 1.48)
𝛽₂ Thursday 2.70 0.61 (1.48; 0.71)
𝛽₃ Friday 1.91 0.64 (0.71; 2.28)
𝛽₄ Saturday 3.50 0.63 (2.28; 3.23)
𝛽₅ Sunday 4.41 0.61 (3.23; 1.73)
𝛽₆ Monday 2.99 0.64 (1.73; 51.17)
𝜎 51.39 0.12 (51.17; 0.00)

4 Discussion

4.1 Comparison of the Models

Cross validation was done to provide an estimation of model performance. The provided posterior predictive summary is comprised of the median absolute error (MAE) which measures the typical difference between observed and posterior predictive means, the scaled MAE (scaled_MAE) which scales MAE by standard deviations, and the proportion of observations that fall within the \(50\%\) and \(95\%\) posterior prediction intervals. The cross validation assessment (prediction_summary_cv()) was chosen over the posterior prediction summary assessment (prediction_summary()), both from the β€œbayesrules” package(Dogucu, Johnson, and Ott 2021), as the cross validation procedure provides a more conservative estimate of model performance. Cross validation can also be more computationally efficient with a lower number of folds and the use of a smaller dataset.

Code
library(bayesrules)

# #Posterior Predictive Summary
# 
# set.seed(123)
# ps_fmdt <- prediction_summary(flat_model_dt, data = Delays_sample)
# 
# set.seed(123)
# ps_dmdt <- prediction_summary(default_model_dt, data = Delays_sample)
# 
# set.seed(123)
# ps_tmdt <- prediction_summary(tuned_model_dt, data = Delays_sample)
# 
# ps_fmdt <- ps_fmdt %>% 
#   mutate(Model = "fmdt")
# ps_dmdt <- ps_dmdt %>% 
#   mutate(Model = "dmdt")
# ps_tmdt <- ps_tmdt %>% 
#   mutate(Model = "tmdt")
# 
# ps_results <- bind_rows(ps_fmdt, ps_dmdt, ps_tmdt)
# 
# print(ps_results)

#K-fold Cross Validation with k=5, data n=1000

set.seed(123)
sample_size <- 1000
Delays_sample2 <- Delays_sample %>% 
  sample_n(sample_size)

set.seed(123)
cv_fmdt <- prediction_summary_cv(
  model = flat_model_dt, data = Delays_sample2, k = 5)

set.seed(123)
cv_dmdt <- prediction_summary_cv(
  model = default_model_dt, data = Delays_sample2, k = 5)

set.seed(123)
cv_tmdt <- prediction_summary_cv(
  model = tuned_model_dt, data = Delays_sample2, k = 5)

cv_fmdt <- cv_fmdt$cv %>% 
  mutate(Model = "fmdt")
cv_dmdt <- cv_dmdt$cv %>% 
  mutate(Model = "dmdt")
cv_tmdt <- cv_tmdt$cv %>% 
  mutate(Model = "tmdt")

cv_results <- bind_rows(cv_fmdt, cv_dmdt, cv_tmdt)

print(cv_results)
Code
#Posterior Predictive Summary
# 
# set.seed(123)
# ps_fmdow <- prediction_summary(flat_model_dow, data = Delays_sample)
# 
# set.seed(123)
# ps_dmdow <- prediction_summary(default_model_dow, data = Delays_sample)
# 
# set.seed(123)
# ps_tmdow <- prediction_summary(tuned_model_dow, data = Delays_sample)
# 
# ps_fmdow <- ps_fmdow %>% 
#   mutate(Model = "fmdow")
# ps_dmdow <- ps_dmdow %>% 
#   mutate(Model = "dmdow")
# ps_tmdow <- ps_tmdow %>% 
#   mutate(Model = "tmdow")
# 
# ps_results2 <- bind_rows(ps_fmdow, ps_dmdow, ps_tmdow)
# 
# print(ps_results2)

#K-fold Cross Validation with k=5, data n=1000

set.seed(123)
sample_size <- 1000
Delays_sample2 <- Delays_sample %>% 
  sample_n(sample_size)

set.seed(123)
cv_fmdow <- prediction_summary_cv(
  model = flat_model_dow, data = Delays_sample2, k = 5)

set.seed(123)
cv_dmdow <- prediction_summary_cv(
  model = default_model_dow, data = Delays_sample2, k = 5)

set.seed(123)
cv_tmdow <- prediction_summary_cv(
  model = tuned_model_dow, data = Delays_sample2, k = 5)

cv_fmdow <- cv_fmdow$cv %>% 
  mutate(Model = "fmdow")
cv_dmdow <- cv_dmdow$cv %>% 
  mutate(Model = "dmdow")
cv_tmdow <- cv_tmdow$cv %>% 
  mutate(Model = "tmdow")

cv_results2 <- bind_rows(cv_fmdow, cv_dmdow, cv_tmdow)

print(cv_results2)
Table 3. Posterior predictive results from cross validation.
MAE MAE Scaled Within 50% Within 95%
Β Β Β Β Β Model 1: Continuous Predictor
Flat Priors 15.730 0.313 0.841 0.966
Default Tuned Priors 15.779 0.314 0.840 0.966
Tuned Priors 15.668 0.312 0.849 0.966
Β Β Β Β Β Model 2: Categorical Predictor
Flat Priors 17.110 0.338 0.866 0.965
Default Tuned Priors 17.080 0.338 0.866 0.966
Tuned Priors 17.118 0.339 0.867 0.966
Code
pp_check(flat_model_dt, nreps = 50) + ggtitle("Posterior Predictive Check: fmdt")

Code
pp_check(default_model_dt, nreps = 50) + ggtitle("Posterior Predictive Check: dmdt")

Code
pp_check(tuned_model_dt, nreps = 50) + ggtitle("Posterior Predictive Check: tmdt")

Code
pp_check(flat_model_dow, nreps = 50) + ggtitle("Posterior Predictive Check: fmdow")

Code
pp_check(default_model_dow, nreps = 50) + ggtitle("Posterior Predictive Check: dmdow")

Code
pp_check(tuned_model_dow, nreps = 50) + ggtitle("Posterior Predictive Check: tmdow")

Code
fmdt_neff <- neff_ratio(flat_model_dt)
dmdt_neff <- neff_ratio(default_model_dt)
tmdt_neff <- neff_ratio(tuned_model_dt)
fmdow_neff <- neff_ratio(flat_model_dow)
dmdow_neff <- neff_ratio(default_model_dow)
tmdow_neff <- neff_ratio(tuned_model_dow)
 
fmdt_neff <- as_tibble_row(fmdt_neff) %>%
  mutate(Model = "fmdt")
dmdt_neff <- as_tibble_row(dmdt_neff) %>% 
  mutate(Model = "dmdt")
tmdt_neff <- as_tibble_row(tmdt_neff) %>% 
  mutate(Model = "tmdt")

fmdow_neff <- as_tibble_row(fmdow_neff) %>% 
  mutate(Model = "fmdow")
dmdow_neff <- as_tibble_row(dmdow_neff) %>% 
  mutate(Model = "dmdow")
tmdow_neff <- as_tibble_row(tmdow_neff) %>% 
  mutate(Model = "tmdow")
 
neff_table_m1 <- bind_rows(fmdt_neff, dmdt_neff, tmdt_neff)

neff_table_m2 <- bind_rows(fmdow_neff, dmdow_neff, tmdow_neff)

neff_table_m1
neff_table_m2
Table 4. Effective sample size ratios for Model 1.
Flat Priors 𝛽₀ intercept 0.83
𝛽₁ Departure Time 1.19
𝜎 0.48
Default Tuned Priors 𝛽₀  intercept 0.80
𝛽₁ Departure Time 1.10
𝜎 0.80
Tuned Priors 𝛽₀ intercept 0.64
𝛽₁ Departure Time 0.71
𝜎 1.04
Table 5. Effective sample size ratios for Model 2.
Flat Priors 𝛽₀ intercept 0.27
𝛽₁ Wednesday 0.36
𝛽₂ Thursday 0.38
𝛽₃ Friday 0.36
𝛽₄ Saturday 0.37
𝛽₅ Sunday 0.35
𝛽₆ Monday 0.36
𝜎 2.92
Default Tuned Priors 𝛽₀ intercept 0.37
𝛽₁ Wednesday 0.60
𝛽₂ Thursday 0.61
𝛽₃ Friday 0.56
𝛽₄ Saturday 0.58
𝛽₅ Sunday 0.59
𝛽₆ Monday 0.55
𝜎 0.66
Tuned Priors 𝛽₀ intercept 0.47
𝛽₁ Wednesday 0.97
𝛽₂ Thursday 0.89
𝛽₃ Friday 1.00
𝛽₄ Saturday 0.95
𝛽₅ Sunday 0.97
𝛽₆ Monday 0.93
𝜎 0.21
Code
fmdt_rhat <- rhat(flat_model_dt)
dmdt_rhat <- rhat(default_model_dt)
tmdt_rhat <- rhat(tuned_model_dt)
fmdow_rhat <- rhat(flat_model_dow)
dmdow_rhat <- rhat(default_model_dow)
tmdow_rhat <- rhat(tuned_model_dow)
 
fmdt_rhat <- as_tibble_row(fmdt_rhat) %>%
  mutate(Model = "fmdt")
dmdt_rhat <- as_tibble_row(dmdt_rhat) %>% 
  mutate(Model = "dmdt")
tmdt_rhat <- as_tibble_row(tmdt_rhat) %>% 
  mutate(Model = "tmdt")

fmdow_rhat <- as_tibble_row(fmdow_rhat) %>% 
  mutate(Model = "fmdow")
dmdow_rhat <- as_tibble_row(dmdow_rhat) %>% 
  mutate(Model = "dmdow")
tmdow_rhat <- as_tibble_row(tmdow_rhat) %>% 
  mutate(Model = "tmdow")
 
rhat_table_m1 <- bind_rows(fmdt_rhat, dmdt_rhat, tmdt_rhat)
rhat_table_m2 <- bind_rows(fmdow_rhat, dmdow_rhat, tmdow_rhat)
 
rhat_table_m1
rhat_table_m2
Table 6. R-hat metric for Model 1.
Flat Priors 𝛽₀ intercept 0.9997
𝛽₁ Departure Time 0.9996
𝜎 1.0015
Default Tuned Priors 𝛽₀  intercept 1.0008
𝛽₁ Departure Time 1.0005
𝜎 1.0004
Tuned Priors 𝛽₀ intercept 1.0004
𝛽₁ Departure Time 0.9996
𝜎 1.0004
Table 7. R-hat metric for Model 2.
Flat Priors 𝛽₀ intercept 1.0045
𝛽₁ Wednesday 1.0027
𝛽₂ Thursday 1.0027
𝛽₃ Friday 1.0028
𝛽₄ Saturday 1.0012
𝛽₅ Sunday 1.0036
𝛽₆ Monday 1.0024
𝜎 0.9994
Default Tuned Priors 𝛽₀ intercept 1.0015
𝛽₁ Wednesday 1.0001
𝛽₂ Thursday 1.0015
𝛽₃ Friday 0.9996
𝛽₄ Saturday 1.0010
𝛽₅ Sunday 0.9995
𝛽₆ Monday 1.0002
𝜎 0.9995
Tuned Priors 𝛽₀ intercept 1.0003
𝛽₁ Wednesday 0.9995
𝛽₂ Thursday 1.0002
𝛽₃ Friday 0.9997
𝛽₄ Saturday 0.9998
𝛽₅ Sunday 1.0008
𝛽₆ Monday 0.9996
𝜎 1.0056

4.2 Evaluation of the Models

Tip

All notes past this point are for future reference

Consideration the Dataset
  • how was data collected, by who and for what purpose, whaty impact might the results have, what biases are baked in to the analyses (scope of the data)
Checking Model Assumptions

β€œNormal regression assumptions

The appropriateness of the Bayesian Normal regression model (9.3) depends upon the following assumptions.

  • Structure of the data
    • Accounting for predictor \(X\), the observed data \(Y_i\) on case \(i\) is independent of the observed data on any other case \(j\).
  • Structure of the relationship
    • The typical \(Y\) outcome can be written as a linear function of predictor X, \(ΞΌ = Ξ²_0 + Ξ²_1X\).
  • Structure of the variability
    • At any value of predictor \(X\), the observed values of \(Y\) will vary normally around their average \(ΞΌ\) with consistent standard deviation \(Οƒ\).”

Assumption 1: https://www.bayesrulesbook.com/chapter-10#:~:text=Though%20formal%20hypothesis%20tests%20for%20assumption%201%20do%20exist%2C%20we%20can%20typically%20evaluate%20its%20appropriateness%20by%20the%20data%20collection%20context

Assumption 2 & 3: scatterplot of raw data, pp_check

# Examine 50 of the 20000 simulated samples
pp_check(bike_model, nreps = 50) + 
  xlab("rides")
Accuracy of Posterior Predictive Models
  • discussion on MCMC simulations -> trace, overlay, acf

  • Neff > (0.1)*N neff_ratio()

5 Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Brilleman, SL, MJ Crowther, M Moreno-Betancur, J Buros Novik, and R Wolfe. 2018. β€œJoint Longitudinal and Time-to-Event Models via Stan.” https://github.com/stan-dev/stancon_talks/.
Dogucu, Mine, Alicia Johnson, and Miles Ott. 2021. Bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book. https://github.com/bayes-rules/bayesrules.
Grolemund, Garrett, and Hadley Wickham. 2011. β€œDates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Johnson, Alicia A, Miles Q Ott, and Mine Dogucu. 2022. Bayes Rules!: An Introduction to Bayesian Modeling with R. Chapman & Hall.
Kahle, David, and Hadley Wickham. 2013. β€œGgmap: Spatial Visualization with Ggplot2.” The R Journal 5 (1): 144–61. https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.
Lesaffre, Emmanuel, and Andrew B Lawson. 2012. Bayesian Biostatistics. 1st ed. Somerset: John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781119942412.
Novak, Taras. 2023. β€œGeo Data Viewer: USA Airports Dataset.” https://github.com/RandomFractals/geo-data-viewer/blob/master/data/csv/usa-airports.csv.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Wei, Taiyun, and Viliam Simko. 2024. R Package ’Corrplot’: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Yan, Xin, and Xiao Gang Su. 2009. Linear Regression Analysis: Theory and Computing. Singapore: World Scientific Publishing. https://ebookcentral.proquest.com/lib/uwf/reader.action?docID=477274&ppg=318&pq-origsite=primo.
Zelazko, Patrick. 2023. β€œFlight Delay and Cancellation Dataset (2019-2023).” https://www.kaggle.com/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023.